Summary

1. Dataset Description

Input variables:

Bank client data:

Output variable (desired target):

Define Global Constants

Import libraries

We define below all the libraries which will be utilized along the coursework

Auxiliary functions

Define AdaBound algorithm Class to use as one of the Neural Networks optimizers.

3. Data Loading / Data Processing

Define global constants

Now we will load the Bank Marketing dataset that will be used for the analysis and critical evaluation between SVM and MLP algorithms.

Load input file

Drop Duration variable

As stated in the dataset description the Duration column is not relevant for prediction purposes, since it is highly correlated to the target variable. So we will drop this column beforehand.

Unique values count

And we count again the unique values after dropping the variable.

Columns basic information

Quantitative variables: age, campaign, pdays, previous, emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed

Categorical variables: job, marital, education, default, housing, loan, contact, month, day_of_week, poutcome

Exploratory Data Analysis

As seen above we have abinary classification problem with highly class-imbalanced data characterized of a ratio approximately 1:9 between the minority ('yes') and majority ('no') classes respectively. So we should evaluate the application of a method to result in a better balanced dataset.

Missing values

We will investigate the dataset for missing values in the columns

We will split the data processing between quantitative and categorical (ordinal/nominal) variables

Quantitative variable analysis (Distribution of features vs classes)

First let's see what is the distribution of quantitative variables

Box plot distribution of quantitative variables

Ideally we would like to verify if there is any high correlation between the independent quantitative variables.

No high correlation (assuming an absolute value of more than 0.5) was observed between the quantitative variables so we will proceed the analysis considering all the quantitative variables initially selected.

After analysing the quantitative variables we still have to investigate the remaining categorical variables.

Ordinal

Nominal

Categorical variables

We shall divide the analysis between Ordinal and Nominal variables and apply the necessary transformations for the training and test set independently. Ordinal variables should be handled taking into consideration the scale between the values from a discrete space. As for the Nominal variables we could create one column for each value they assume.

Specifically for the contact variable we will consider as an Ordinal variable due to assuming only 2 values and expect to eliminate redundancies if we encode to two columns.

Ordinal variables

Next we can encode the 'Yes','No','Unknown' values and 'Failure','Nonexistent','Success' from (Default, Housing, Loan, and Poutcome respectively) to ordinal numbers, because they represent scales of values in a discrete space, differently from other categorical variables which can experience a certain increase in their category size. Moreover, we know the relationship between the categories.

Therefore, each value would have a different contribution to the target variable. We will consider the OrdinalEncoder method from sklearn and define the order of categories as the following: 'Yes'/'Failure' then 'Unknown'/'Nonexistent' and lastly 'No'/'Success' and will be mapped to integer numerical values to be used as input values of the neural network subsequently.

To accomplish this we will implement the transformation inside the transform_input function (below) and create a new dataframe containing only the ordinal variables to be added later to the rest of the variables.

Nominal variables

Let's investigate the values assumed by the 3 remaining variables (Job/Marital/Education) to verify if any of the variables has a distance relation between values and can be transformed in a ordinal column.

Split dataset

Before applying transformation of categorical variables we should first split the data between training and test sets. One of the reasons we consider this approach is if for instance there is a rare category from one categorical variable in one (or more) of the dataset variables and this category could be in the unseen data (test set). Therefore, we cannot include this category in the training set to avoid data leakage and exposing the test data to the full distribution.

As a best practice we should encode variables by applying the fit on the encoded training dataset, then apply it on both training and test datasets.

Prepare data (Input variables / Target variables)

Now we can call the functions and have the resultant transformed sets

Encode training and test sets

Now we verify the dimensions of encoded arrays

4. Model Implementation

4.1. MLP Implementation

Define initial parameters

4.1.1. Network architecture

Define network architecture and create instance

Define Neural Network architecture

Visualize the Neural Network architecture

Network initial parameters

4.1.2. Scorch Classifier

Now we feed our model to the skorch classifier and define a skorch object with initial parameters before optimization.

First we should define a best set of weights to balance the classes (using GridsearchCV)

Define initial training function for the MLP NN architecture (The function returns the best model)

Save model selection estimator to the disk

Plot Training and Validation history

Accuracies per hyperparameter settings iteration

Plot Training vs Validation Learning Curves (Accuracy / Loss)

Save results from the RandomizerSearch into a dataframe for analysis

Save Randomizer Search results into a dataframe for plotting the Parallel Plot

Save hyperparameter results to CSV file (Top 20)

Plot hyperparameter search results

Using ROC

Using accuracy

Plot table of the hyperparameters and metrics of roc_auc, pr_auc and accuracy.

Analyse Prediction Results and compute metrics

Now we want to analyse how the model can be compared against the training data:

From the above definitions we can clearly state that False Positives can generate less benefits compared to False Negatives. Therefore it is more important to have less False Positives than the opposite.

F1-Score: Support: Each class has a support number to represent the amount of examples for the class.

ROC-AUC Curve

Precision-Recall Curve (PR-AUC)

Precision-Recall Area Under Curve (AUC) Score

Final model training

Retrain model with the entire training set (considering the initial training and validation data from 80% of the original dataset)

We will follow a strategy idealized by the Deep Learning book (Aaron Courville, Ian Goodfellow, and Yoshua Bengio):

Short version

We define a new instance of the Neural Network with the parameters found previously.

Now that we have retrained the model we can save the model that will be applied against the test set for the algorithm comparison

Save MLP best model

Save best hyperparameters to pkl file

Save final NN model, history and optimizer state

Load MLP best model for testing

Load best hyperparameters

Load saved model

Display Learning curves to see if overfitting or underfitting data

By observing the learning curves, I can tell if the Neural Network overfitted or underfitted the data. Overfit : if the training loss curve is significantly lower than the validation loss curve. Underfit: if both the training loss curve and the validation loss curve are very high loss. Ideal: both the training loss and validation loss curves have a minimal gap between them and converge to a very low loss.

Precision vs Recall

Confusion Matrix of the best selected MLP model (Training with entire training data)

Test best chosen model

4.2. SVM Implementation

Linear SVM model (Linear kernel)

Before applying optimization we decided to apply a linear kernel to test validation set performance using the same dataset as the one used for training the neural network. We will use the same pipeline (imblearn) and run the sklearn implementation of SVC with a linear kernel against the training set.

Let's define a SVC instance with a linear kernel and a default value for C regularization hyperparameter (1.0) and auto value for gamma which is set to 1 / n_features (approximately 0.025 ~ 1/40)

Train SVM with linear kernel

Plot learning curves of Linear SVC (Using CV validation with K=5)

Systematic approach to test other kernels and hyperparameters

Let's first define the pipeline

We will use the same pipeline as the one used before and the only difference would be the hyperparameters and the heuristic of the model. Due to the large size of the training set the SVM algorithm can take a long time to complete the hyperparameter search. The fit time scales at least quadratically with the number of samples and may be impractical beyond 10000 samples according to studies.

Due to the large size of the training set (32950 observations) we opted to optimise using a stratified sample of the training set defined to 70%, which will maintain the same proportion of classes. After that we apply Standardisation, SMOTE and UnderSampling (only runs for the training set) enclosed in a pipeline to run a RandomizedSearch cross-validation using SVM with several parameter configurations for the 3 kernel initially selected (linear, poly, rbf) and C and gamma intervals as well.

Then we will run the same configuration using RandomizedSearch cross-validation and compare if the results outperforms GridSearch.

At last we will select the best parameters and train again using the entire training set and use the model object to predict the results against the original test set previously defined.

Split training set (32,950 observations) with a stratified sample of 50% of the original training set size.

New dataset

Now we can define the hyperparameters space that will be used first. We opted for not using poly kernel, which are more suitable for Grid Search and the absence of proper hyperparameters to tune in a Random Search (our case). Therefore, we propose to optimize the algorithm using only the 2 other kernels (Linear and RBF).

Randomized Search of Hyperparameters for SVM algorithm

Hyperparameters will be set in the following interval:

Generate scores of predictions for the best hyperparameters

Plot SVM Hyperparameter Results

Based on ROC Curve

Based on Accuracy

Analysis of C and Gamma Parameters (Using Accuracy and ROC-AUC)

Train the final model on the entire training dataset

We will analyse the Precision-Recall curve over the training set using the best model (Training over the entire training data)

We will analyse the Precision-Recall curve over the training set using the best model (Training over the entire training data)

Now we want to analyse how the model can be compared against the training data:

From the above definitions we can clearly state that False Positives can generate less benefits compared to False Negatives. Therefore it is more important to have less False Positives than the opposite.

F1-Score: Support: Each class has a support number to represent the amount of examples for the class.

ROC-AUC Curve

Precision-Recall Curve (PR-AUC)

Confusion Matrix of the SVM model selection (Training with cross-validation over 50% of the training sample)

Confusion Matrix of the best selected SVM model (Training with entire training data)

Save final model (classifier) for prediction on the test set

Due to the large size of the training set the SVM algorithm can take a long time to complete the hyperparameter search. The fit time scales at least quadratically with the number of samples and may be impractical beyond 10000 samples.

RandomizedSearch implementation

Running in the entire training set and Predicting on the test set

Now let's apply a kernel transformation, as it is based on a similarity relationship (or kernel) between each pair of points.

Loading Best Models for Inference

Saving test set for prediction on the saved models

Saving MLP best learning model

Saving SVM best learning model

7. Models reproducibility

Example for MLP reproducibility using 64 iterations for RandomSearch (Uncomment for training)

Example for SVM reproducibility using 32 iterations for RandomSearch (Uncomment for training)